Spring R ’24
Dominic Bordelon, Research Data Librarian
University Library System, University of Pittsburgh
dbordelon@pitt.edu
Services for the Pitt community:
Support areas and interests:
{rsample}, part of {tidymodels}, for train/test splits{recipes}, part of {tidymodels}, for preprocessing{naivebayes}{rpart} and {rpart.plot} for decision trees{kknn} for K Nearest Neighbors{tidyclust}, part of {tidymodels}, for K-Means clustering💡 In terms of writing code, there are a variety of approaches to modeling in R, even for fitting the same type of model (e.g., when implemented by different package developers). We will favor the tidymodels approach.
Often describes statistical concepts with different language, due to separate disciplinary traditions.
| Statistics term | ML / Computer Science term |
|---|---|
| observation, case | example, instance |
| response variable, dependent variable | label, output |
| predictor, independent variable | feature, input |
| regression | regression, supervised learner, machine |
| estimation | learning |
⚠ Terms/concepts to be careful with in ML, coming from stats:
hypothesis (sometimes an output of a classifier model)
bias (broader meaning)
causality (sometimes less rigorous than stats)
Regression
Classification
Animation of the naive Bayes classifier. Color intensity indicates probability of group membership. Image source: Jacopo Bertolotti via Wikimedia Commons (CC0)
Animation of a simple decision tree example. Each binary branch in the tree on the left corresponds to a partitioning in the x-y space. The response variable (output) of this model is gray/green color classification. Image source: Algobeans
Image source: James et al. 2021
K Nearest Neighbors
Effects of changing K.
library(kknn)
# centering and scaling data:
pens_recipe <- recipe(sex ~ species + body_mass_g, data = pens_train) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_center(-sex) %>%
step_scale(-sex) %>%
prep()
pens_train_juiced <- juice(pens_recipe)
# visualize processed data:
pens_train_juiced %>% pivot_longer(-sex) %>%
ggplot() +
geom_histogram(aes(value, fill = sex)) +
facet_wrap(~name)
baked_test <- bake(pens_recipe, new_data = pens_test)Fitting
Metrics:
Unsupervised learning has no predictive model: instead it finds previously unknown structure in the data. All variables or features of the data are considered together.
Unsupervised learning tends to be most useful for exploratory data analysis, i.e., prior to having a goal for regression or classification.
Animation demonstrating projection of two features onto a single histogram using principal components analysis. Image source: Amélia O. F. da S. via Wikimedia Commons (CC BY-SA 4.0)
Image source: James et al. 2021
PCA is a recipe step.
Read more here for important details: https://recipes.tidymodels.org/reference/step_pca.html
Animation of the K-means algorithm in action. After initial random group assignment, centroids are randomly placed and used to classify. Then centroids and assignment are iteratively adjusted until movement stops. Image source: Chire on Wikimedia Commons (CC BY-SA 4.0)
150 observations in 2D space, clustered according to different values of K. Prior to clustering, data are not categorized. Colors indicate which group each observation is assigned to by the model. Image source: James et al. 2021
Hierarchical clustering example. Cutting vertically, at different points along the x axis, will create different numbers of clusters. Image source: Wikimedia Commons
Check out the Big Book of R! An online directory at https://www.bigbookofr.com/ of very many R ebooks, most of them free OER and produced by experts, organized by discipline/topic and searchable.
Look up your discipline (or some topic that interests you, e.g., time series data) and see what applications of R you can find.
Example graphic of a recent update
R 5: Machine Learning Intro